Insurance Fraud Prediction

Table of Content

Step 0: Introduction

Insurance fraud is a deliberately false or misrepresented claim by an insured/claimant or entity for the purpose of financial gain.It is one of the largest and most well-known problems that insurers face. Fraudulent claims can be highly expensive for each insurer. Therefore, it is important to know which claims are correct and which are not. It is not doable for insurance companies to check all claims personally since this will cost simply too much time and money. Fraud can be committed at different touchpoints in the insurance lifecycle by insured applicants, policyholders, third-party claimants or professionals such as insurance agencies/agents who provide such services.

The goal of this project is to build a model that can detect auto insurance fraud.

Step 1: Dataset

The largest asset which insurers have in the fight against fraud is data. The raw data has 40 variables including the target ‘Fraud Reported’. Some of the variables include names like policy number, policy bind date, policy annual premium, incident severity, incident location, auto model.

Step 2: Needed libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score, classification_report, confusion_matrix
import warnings
warnings.simplefilter("ignore")

Step 3: Exploratory Data Analysis (EDA)

In [2]:
# Reading data

df = pd.read_csv('insurance_claims.csv')
In [3]:
pd.set_option('display.max_columns', None)
In [4]:
# Five row of data

df.head()
Out[4]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported _c39
0 328 48 521585 2014-10-17 OH 250/500 1000 1406.91 0 466132 MALE MD craft-repair sleeping husband 53300 0 2015-01-25 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 9935 4th Drive 5 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 2004 Y NaN
1 228 42 342868 2006-06-27 IN 250/500 2000 1197.22 5000000 468176 MALE MD machine-op-inspct reading other-relative 0 0 2015-01-21 Vehicle Theft ? Minor Damage Police VA Riverwood 6608 MLK Hwy 8 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 2007 Y NaN
2 134 29 687698 2000-09-06 OH 100/300 2000 1413.14 5000000 430632 FEMALE PhD sales board-games own-child 35100 0 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Police NY Columbus 7121 Francis Lane 7 3 NO 2 3 NO 34650 7700 3850 23100 Dodge RAM 2007 N NaN
3 256 41 227811 1990-05-25 IL 250/500 2000 1415.74 6000000 608117 FEMALE PhD armed-forces board-games unmarried 48900 -62400 2015-01-10 Single Vehicle Collision Front Collision Major Damage Police OH Arlington 6956 Maple Drive 5 1 ? 1 2 NO 63400 6340 6340 50720 Chevrolet Tahoe 2014 Y NaN
4 228 44 367455 2014-06-06 IL 500/1000 1000 1583.91 6000000 610706 MALE Associate sales board-games unmarried 66000 -46000 2015-02-17 Vehicle Theft ? Minor Damage None NY Arlington 3041 3rd Ave 20 1 NO 0 1 NO 6500 1300 650 4550 Accura RSX 2009 N NaN
In [5]:
# Last five rows of data

df.tail()
Out[5]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported _c39
995 3 38 941851 1991-07-16 OH 500/1000 1000 1310.80 0 431289 FEMALE Masters craft-repair paintball unmarried 0 0 2015-02-22 Single Vehicle Collision Front Collision Minor Damage Fire NC Northbrook 6045 Andromedia St 20 1 YES 0 1 ? 87200 17440 8720 61040 Honda Accord 2006 N NaN
996 285 41 186934 2014-01-05 IL 100/300 1000 1436.79 0 608177 FEMALE PhD prof-specialty sleeping wife 70900 0 2015-01-24 Single Vehicle Collision Rear Collision Major Damage Fire SC Northbend 3092 Texas Drive 23 1 YES 2 3 ? 108480 18080 18080 72320 Volkswagen Passat 2015 N NaN
997 130 34 918516 2003-02-17 OH 250/500 500 1383.49 3000000 442797 FEMALE Masters armed-forces bungie-jumping other-relative 35100 0 2015-01-23 Multi-vehicle Collision Side Collision Minor Damage Police NC Arlington 7629 5th St 4 3 ? 2 3 YES 67500 7500 7500 52500 Suburu Impreza 1996 N NaN
998 458 62 533940 2011-11-18 IL 500/1000 2000 1356.92 5000000 441714 MALE Associate handlers-cleaners base-jumping wife 0 0 2015-02-26 Single Vehicle Collision Rear Collision Major Damage Other NY Arlington 6128 Elm Lane 2 1 ? 0 1 YES 46980 5220 5220 36540 Audi A5 1998 N NaN
999 456 60 556080 1996-11-11 OH 250/500 1000 766.19 0 612260 FEMALE Associate sales kayaking husband 0 0 2015-02-26 Parked Car ? Minor Damage Police WV Columbus 1416 Cherokee Ridge 6 1 ? 0 3 ? 5060 460 920 3680 Mercedes E400 2007 N NaN
In [6]:
# Shape of data

print("The dataset has {} rows and {} columns. \n" .format(df.shape[0], df.shape[1]))
The dataset has 1000 rows and 40 columns. 

In [7]:
# Datatypes

df.dtypes
Out[7]:
months_as_customer               int64
age                              int64
policy_number                    int64
policy_bind_date                object
policy_state                    object
policy_csl                      object
policy_deductable                int64
policy_annual_premium          float64
umbrella_limit                   int64
insured_zip                      int64
insured_sex                     object
insured_education_level         object
insured_occupation              object
insured_hobbies                 object
insured_relationship            object
capital-gains                    int64
capital-loss                     int64
incident_date                   object
incident_type                   object
collision_type                  object
incident_severity               object
authorities_contacted           object
incident_state                  object
incident_city                   object
incident_location               object
incident_hour_of_the_day         int64
number_of_vehicles_involved      int64
property_damage                 object
bodily_injuries                  int64
witnesses                        int64
police_report_available         object
total_claim_amount               int64
injury_claim                     int64
property_claim                   int64
vehicle_claim                    int64
auto_make                       object
auto_model                      object
auto_year                        int64
fraud_reported                  object
_c39                           float64
dtype: object
In [8]:
# Non null values, data type

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 40 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   months_as_customer           1000 non-null   int64  
 1   age                          1000 non-null   int64  
 2   policy_number                1000 non-null   int64  
 3   policy_bind_date             1000 non-null   object 
 4   policy_state                 1000 non-null   object 
 5   policy_csl                   1000 non-null   object 
 6   policy_deductable            1000 non-null   int64  
 7   policy_annual_premium        1000 non-null   float64
 8   umbrella_limit               1000 non-null   int64  
 9   insured_zip                  1000 non-null   int64  
 10  insured_sex                  1000 non-null   object 
 11  insured_education_level      1000 non-null   object 
 12  insured_occupation           1000 non-null   object 
 13  insured_hobbies              1000 non-null   object 
 14  insured_relationship         1000 non-null   object 
 15  capital-gains                1000 non-null   int64  
 16  capital-loss                 1000 non-null   int64  
 17  incident_date                1000 non-null   object 
 18  incident_type                1000 non-null   object 
 19  collision_type               1000 non-null   object 
 20  incident_severity            1000 non-null   object 
 21  authorities_contacted        1000 non-null   object 
 22  incident_state               1000 non-null   object 
 23  incident_city                1000 non-null   object 
 24  incident_location            1000 non-null   object 
 25  incident_hour_of_the_day     1000 non-null   int64  
 26  number_of_vehicles_involved  1000 non-null   int64  
 27  property_damage              1000 non-null   object 
 28  bodily_injuries              1000 non-null   int64  
 29  witnesses                    1000 non-null   int64  
 30  police_report_available      1000 non-null   object 
 31  total_claim_amount           1000 non-null   int64  
 32  injury_claim                 1000 non-null   int64  
 33  property_claim               1000 non-null   int64  
 34  vehicle_claim                1000 non-null   int64  
 35  auto_make                    1000 non-null   object 
 36  auto_model                   1000 non-null   object 
 37  auto_year                    1000 non-null   int64  
 38  fraud_reported               1000 non-null   object 
 39  _c39                         0 non-null      float64
dtypes: float64(2), int64(17), object(21)
memory usage: 312.6+ KB
In [9]:
# describe- statistical description of the numerical variables in our dataset

df.describe()
Out[9]:
months_as_customer age policy_number policy_deductable policy_annual_premium umbrella_limit insured_zip capital-gains capital-loss incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim auto_year _c39
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000 0.0
mean 203.954000 38.948000 546238.648000 1136.000000 1256.406150 1.101000e+06 501214.488000 25126.100000 -26793.700000 11.644000 1.83900 0.992000 1.487000 52761.94000 7433.420000 7399.570000 37928.950000 2005.103000 NaN
std 115.113174 9.140287 257063.005276 611.864673 244.167395 2.297407e+06 71701.610941 27872.187708 28104.096686 6.951373 1.01888 0.820127 1.111335 26401.53319 4880.951853 4824.726179 18886.252893 6.015861 NaN
min 0.000000 19.000000 100804.000000 500.000000 433.330000 -1.000000e+06 430104.000000 0.000000 -111100.000000 0.000000 1.00000 0.000000 0.000000 100.00000 0.000000 0.000000 70.000000 1995.000000 NaN
25% 115.750000 32.000000 335980.250000 500.000000 1089.607500 0.000000e+00 448404.500000 0.000000 -51500.000000 6.000000 1.00000 0.000000 1.000000 41812.50000 4295.000000 4445.000000 30292.500000 2000.000000 NaN
50% 199.500000 38.000000 533135.000000 1000.000000 1257.200000 0.000000e+00 466445.500000 0.000000 -23250.000000 12.000000 1.00000 1.000000 1.000000 58055.00000 6775.000000 6750.000000 42100.000000 2005.000000 NaN
75% 276.250000 44.000000 759099.750000 2000.000000 1415.695000 0.000000e+00 603251.000000 51025.000000 0.000000 17.000000 3.00000 2.000000 2.000000 70592.50000 11305.000000 10885.000000 50822.500000 2010.000000 NaN
max 479.000000 64.000000 999435.000000 2000.000000 2047.590000 1.000000e+07 620962.000000 100500.000000 0.000000 23.000000 4.00000 2.000000 3.000000 114920.00000 21450.000000 23670.000000 79560.000000 2015.000000 NaN
In [10]:
# Check for missing values

df.isnull().sum()
Out[10]:
months_as_customer                0
age                               0
policy_number                     0
policy_bind_date                  0
policy_state                      0
policy_csl                        0
policy_deductable                 0
policy_annual_premium             0
umbrella_limit                    0
insured_zip                       0
insured_sex                       0
insured_education_level           0
insured_occupation                0
insured_hobbies                   0
insured_relationship              0
capital-gains                     0
capital-loss                      0
incident_date                     0
incident_type                     0
collision_type                    0
incident_severity                 0
authorities_contacted             0
incident_state                    0
incident_city                     0
incident_location                 0
incident_hour_of_the_day          0
number_of_vehicles_involved       0
property_damage                   0
bodily_injuries                   0
witnesses                         0
police_report_available           0
total_claim_amount                0
injury_claim                      0
property_claim                    0
vehicle_claim                     0
auto_make                         0
auto_model                        0
auto_year                         0
fraud_reported                    0
_c39                           1000
dtype: int64
In [11]:
# variable _c39 has 1000 missing values and has to be dropped

df.drop('_c39', inplace=True, axis=1)
In [12]:
# Unique value count of each categorical feature

df1 = df.select_dtypes(include=[object])
print(df1.nunique())
policy_bind_date            951
policy_state                  3
policy_csl                    3
insured_sex                   2
insured_education_level       7
insured_occupation           14
insured_hobbies              20
insured_relationship          6
incident_date                60
incident_type                 4
collision_type                4
incident_severity             4
authorities_contacted         5
incident_state                7
incident_city                 7
incident_location          1000
property_damage               3
police_report_available       3
auto_make                    14
auto_model                   39
fraud_reported                2
dtype: int64

Step 4: Data Visualization

Data visualization provides a good, organized pictorial representation of the data which makes it easier to understand, observe, analyze. We are able to observe and interpret different patterns and trends which exist in our data.

In [13]:
# histogram of Fraud_reported

px.histogram(df,x="fraud_reported",color='fraud_reported',title="Fraud_reported count")

They are more non fraudulent counts than fraudulent counts

In [14]:
# Pie chart of fraud reported

label_cnts = df.fraud_reported.value_counts()

# Plot value_counts
px.pie(names = ["No Fraud Reported", "Fraud Reported"],values = label_cnts.values,title="Fraud reported",height=700,width=700)

75.3% claims are non fraudulent while 24.7% are fraudulent claims

In [15]:
# Histogram of incident type

px.histogram(df,"incident_type", color= "fraud_reported", title="Incident type")

People with incident type Single Vehicle Collision and Multi Vehicle Collision filed the highest claims

In [16]:
# Histogram of incident type accorded to fraud reported

df2 = df[df['fraud_reported']=='Y']
px.histogram(df2,"incident_type",color="fraud_reported",title="Incident type according to fraud reported")
In [17]:
# Pie chart of incident type with frauds reported

incident_type_count = df2.incident_type.value_counts()
# Plot value_counts
px.pie(names = incident_type_count.index,values = incident_type_count.values,title="incident type according to fraud reported",height=700,width=700)

Single Vehicle collision and Multi-Vehicle Collision have more fraudulent claims than any other claim_type

In [18]:
# Histogram of insured sex

px.histogram(df,"insured_sex",color="fraud_reported",title="Insured sex ")
In [19]:
# Pie chart of incident type with frauds reported

insured_sex_count = df2.insured_sex.value_counts()
# Plot value_counts
px.pie(names = insured_sex_count.index,values = insured_sex_count.values,title="Sex type per fraudulent claims",height=700,width=700)

Surprisingly women committed more fraud that men

In [20]:
# Histogram of eductional level


px.histogram(df, x = "insured_education_level",color="fraud_reported",title="education level")
In [21]:
# Pie chart of insured education level according to fraud reported

insured_education_level_count = df2.insured_education_level.value_counts()
# Plot value_counts
px.pie(names = insured_education_level_count.index,values = insured_education_level_count.values,title="insured education level according to fraudulent reported",height=700,width=700)

Those with the educational level equal to Juris Doctor (JD) have the highest fraud reported claims

In [22]:
# Incident severity


px.histogram(df, x = "incident_severity",color="fraud_reported",title="incident severity")

Those with Minor Damge filed more claims while those with major damage had more fraudulent claims

In [23]:
# Pie chart of incident severity according to fraud reported

incident_severity_count = df2.incident_severity.value_counts()
# Plot value_counts
px.pie(names = incident_severity_count.index,values = incident_severity_count.values,title="incident severity according to fraud",height=700,width=700)

People with major damage incidents filed more fraudulent claims

In [24]:
# hsitogram of insured occupation

px.histogram(df,"insured_occupation",color="fraud_reported",title="Insured occupation")
In [25]:
# Pie chart of insured occupation according to fraud reported

insured_occupation_count = df2.insured_occupation.value_counts()
# Plot value_counts
px.pie(names = insured_occupation_count.index,values = insured_occupation_count.values,title="insured occupation according to fraud",height=700,width=700)

claimants who hold exec managerial positions filed more frudulent claims

In [26]:
# Histogram of authorities contacted

px.histogram(df,"authorities_contacted",color="fraud_reported",title="Authorities contacted")
In [27]:
# Pie chart of insured occupation according to fraud reported

authorities_contacted_count = df2.authorities_contacted.value_counts()
# Plot value_counts
px.pie(names = authorities_contacted_count.index,values = authorities_contacted_count.values,title="Authorities contacted according to fraud",height=700,width=700)

Those with others committed more fraud

In [28]:
# boxplot of age according to fraud_reported

px.box(df,y="age",title="Age")
In [29]:
# boxplot of age according to fraud_reported

px.box(df,y="age", color="fraud_reported",title="Age per fraud")
In [30]:
# boxplot of age according to fraud_reported

px.box(df,y="total_claim_amount",title="Total claim amount")
In [31]:
# boxplot of age according to fraud_reported

px.box(df,y="total_claim_amount",color='fraud_reported', title="Total claim amount according to fraud")
In [32]:
# Box plot of months as customer

px.box(df,y="months_as_customer",title="months_as_customer")
In [33]:
# boxplot of age according to fraud_reported

px.box(df,y="months_as_customer", color="fraud_reported",title="months_as_customer")
In [34]:
# boxplot of age according to fraud_reported

px.box(df,y="witnesses", color="fraud_reported",title="witnesses")

Some variables have a few outliers as seen from the boxplots. Outliers are data points that are distant from other similar point. One way to handle outliers is through imputation or training data with ensemble models like random forest and gradient boosting which can handle outliers

In [35]:
# Correlation plot

corr= df.corr()
#heat map of correlation
plt.figure(figsize=(15,15))
sns.heatmap(corr, annot=True)
plt.title('Correlation Heatmap', fontdict={'fontsize':24}, pad=12)
plt.show()

Step 5: Data Preprocessing

In [36]:
df.sample(10)
Out[36]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported
924 135 30 913464 2009-01-21 IN 500/1000 2000 1341.24 0 601701 FEMALE MD farming-fishing skydiving wife 37100 -46500 2015-01-19 Multi-vehicle Collision Rear Collision Minor Damage Ambulance WV Riverwood 9317 Apache Ave 18 3 NO 0 1 NO 32670 5940 2970 23760 Honda Accord 2003 N
270 369 55 577810 2013-04-15 OH 250/500 2000 1589.54 0 444734 MALE College handlers-cleaners camping husband 55400 0 2015-01-27 Multi-vehicle Collision Rear Collision Minor Damage Police VA Arlington 9373 Pine Hwy 6 3 ? 2 0 YES 85300 17060 8530 59710 Toyota Highlander 2003 N
763 66 30 984456 2003-06-24 IN 500/1000 500 484.67 0 608309 FEMALE College adm-clerical paintball wife 21100 -60800 2015-01-24 Multi-vehicle Collision Front Collision Major Damage Fire SC Arlington 2889 Weaver St 2 3 ? 0 2 YES 65560 11920 11920 41720 Volkswagen Passat 2015 Y
777 239 40 488724 2004-11-29 IN 100/300 500 1463.95 0 430567 FEMALE JD sales skydiving own-child 0 0 2015-02-11 Multi-vehicle Collision Rear Collision Total Loss Police NC Springfield 4545 4th Ridge 20 3 ? 0 0 YES 69740 6340 6340 57060 Dodge Neon 2003 N
86 153 34 798177 2006-03-04 IL 500/1000 1000 873.64 4000000 432934 FEMALE Associate priv-house-serv yachting husband 800 0 2015-01-30 Multi-vehicle Collision Front Collision Minor Damage Other SC Columbus 9489 3rd St 9 3 NO 2 1 ? 68400 11400 11400 45600 Ford F150 2007 N
920 214 40 118236 2000-08-15 OH 100/300 1000 1648.00 0 608405 MALE JD transport-moving base-jumping not-in-family 57700 -43500 2015-02-04 Single Vehicle Collision Rear Collision Total Loss Other WV Northbrook 6638 Tree Drive 17 1 NO 1 0 YES 44220 8040 4020 32160 Accura MDX 2000 N
379 157 32 347984 2009-10-21 OH 100/300 2000 617.11 0 436711 MALE College other-service reading other-relative 0 -54100 2015-01-02 Multi-vehicle Collision Front Collision Major Damage Other VA Columbus 6658 Weaver St 14 3 ? 1 2 NO 50800 10160 5080 35560 Mercedes E400 2013 Y
138 325 46 935277 2013-07-09 IL 500/1000 500 1348.83 0 474360 FEMALE High School prof-specialty basketball wife 46300 -77500 2015-02-01 Multi-vehicle Collision Rear Collision Minor Damage Fire NC Springfield 9358 Texas Ridge 21 3 ? 1 2 YES 76120 13840 6920 55360 Toyota Camry 1999 N
778 161 38 192524 2004-01-02 IL 100/300 2000 1133.85 0 439870 MALE PhD priv-house-serv exercise not-in-family 60200 0 2015-01-03 Multi-vehicle Collision Front Collision Total Loss Police WV Springfield 2272 Embaracadero Drive 0 3 YES 2 2 YES 60480 5040 15120 40320 Volkswagen Jetta 2003 N
921 178 38 987524 2014-11-13 IL 250/500 500 1381.14 0 472253 FEMALE College other-service camping wife 0 0 2015-02-22 Multi-vehicle Collision Rear Collision Minor Damage Other NY Northbrook 5678 Lincoln Drive 10 3 NO 0 3 NO 57200 5200 10400 41600 BMW M5 2011 N

The data has an unwanted character '?' which needs to be treated

In [37]:
# Dropping unnecessary variables

df = df.drop(columns = ['policy_number', 'policy_bind_date', 'policy_csl','insured_zip', 'incident_date','incident_location', 'policy_state', 'incident_city', 'insured_relationship', 'auto_make', 'auto_model', 'auto_year'])

It is always good to split data before preprocessing to avoid data leakage

In [38]:
# Splitting data into train and test set

df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)
In [39]:
df_train = df_train.replace('?',np.NaN)
df_test = df_test.replace('?', np.NaN)
In [40]:
df_train.isnull().sum()
Out[40]:
months_as_customer               0
age                              0
policy_deductable                0
policy_annual_premium            0
umbrella_limit                   0
insured_sex                      0
insured_education_level          0
insured_occupation               0
insured_hobbies                  0
capital-gains                    0
capital-loss                     0
incident_type                    0
collision_type                 144
incident_severity                0
authorities_contacted            0
incident_state                   0
incident_hour_of_the_day         0
number_of_vehicles_involved      0
property_damage                288
bodily_injuries                  0
witnesses                        0
police_report_available        270
total_claim_amount               0
injury_claim                     0
property_claim                   0
vehicle_claim                    0
fraud_reported                   0
dtype: int64
In [41]:
# Replacing missing values with a new class 'Missing' 
for col in df_train:
    df_train[col]=df_train[col].fillna('Missing')
    
for col in df_test:
    df_test[col]= df_test[col].fillna('Missing')
In [42]:
df_train.head(5)
Out[42]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit insured_sex insured_education_level insured_occupation insured_hobbies capital-gains capital-loss incident_type collision_type incident_severity authorities_contacted incident_state incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim fraud_reported
687 194 41 500 1203.81 0 MALE JD transport-moving video-games 52500 -51300 Multi-vehicle Collision Rear Collision Minor Damage Police WV 17 3 Missing 0 2 Missing 95900 13700 20550 61650 N
500 1 29 500 854.58 0 FEMALE JD craft-repair paintball 52200 0 Single Vehicle Collision Side Collision Minor Damage Police SC 15 1 Missing 2 3 YES 86790 7890 23670 55230 N
332 85 25 500 1259.02 0 FEMALE JD tech-support base-jumping 67000 -53600 Parked Car Missing Trivial Damage None SC 8 1 NO 2 2 Missing 5640 940 940 3760 N
979 229 37 1000 1331.94 0 FEMALE Masters farming-fishing base-jumping 0 -55400 Single Vehicle Collision Rear Collision Total Loss Other NY 17 1 NO 0 2 YES 54560 9920 9920 34720 N
817 250 42 500 1055.60 0 MALE High School exec-managerial paintball 69500 -40700 Single Vehicle Collision Rear Collision Major Damage Other SC 16 1 Missing 1 1 Missing 74800 13600 6800 54400 Y
In [43]:
# Separating independent variables from the target variable

y_train= df_train.pop('fraud_reported')
X_train = df_train

y_test = df_test.pop('fraud_reported')
X_test = df_test
In [44]:
# Encoding categroical variable with one hot encoder

X_train = pd.get_dummies(X_train)    # Encoding the train set


X_test = pd.get_dummies(X_test)   # Encoding the test set
In [45]:
X_train.head()
Out[45]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft collision_type_Front Collision collision_type_Missing collision_type_Rear Collision collision_type_Side Collision incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV property_damage_Missing property_damage_NO property_damage_YES police_report_available_Missing police_report_available_NO police_report_available_YES
687 194 41 500 1203.81 0 52500 -51300 17 3 0 2 95900 13700 20550 61650 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0
500 1 29 500 854.58 0 52200 0 15 1 2 3 86790 7890 23670 55230 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1
332 85 25 500 1259.02 0 67000 -53600 8 1 2 2 5640 940 940 3760 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0
979 229 37 1000 1331.94 0 0 -55400 17 1 0 2 54560 9920 9920 34720 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1
817 250 42 500 1055.60 0 69500 -40700 16 1 1 1 74800 13600 6800 54400 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0
In [46]:
# encoding the target with label encoding

from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()

y_train = labelencoder.fit_transform(y_train)

y_test = labelencoder.transform(y_test)
In [47]:
X_train.shape, X_test.shape  # Shape of train and test set
Out[47]:
((800, 88), (200, 88))
In [48]:
# Scaling train and test set with MinMaxScaler

# MinMaxScaler transform the variables into a range of [0, 1]

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)

X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)

The dataset in imbalance, SMOTE will help to balance up the classes

In [49]:
# Treating imbalance data in training dataset

from imblearn.over_sampling import SMOTE

from collections import Counter

counter = Counter(y_train)

print('before smoting: ', counter)

smt = SMOTE()

X_train, y_train = smt.fit_resample(X_train, y_train)

counter = Counter(y_train)
print('After smoting: ', counter)
before smoting:  Counter({0: 610, 1: 190})
After smoting:  Counter({0: 610, 1: 610})

Step 6: Modeling Building and Evaluation

I will apply the following supervised learning models.

(1): Logistic Regression

(2): Decision Tree

(3): Random Forest

(4): Gradient Boosting

In [50]:
index = ["LogisticRegression", "DecisionTreeClassifier", "GradientBoostingClassifier", "RandomForestClassifier"]
results = pd.DataFrame(columns=['Accuracy','Precison','Recall', 'f1_score', 'AUC'],index=index)

Logistic Regression

Logistic regression is a linear model for classification. It is simple to implement, efficient and fast.

In [51]:
# set tuning paramters
from sklearn.linear_model import LogisticRegression 

LR = LogisticRegression(C=0.1, penalty='l1', solver='liblinear')

# Fitting model to train set

LR.fit( X_train, y_train) 

# Checking for overfitting and underfitting

print(" Accuracy on training set: ",  LR.score( X_train, y_train)) 

print(" Accuracy on test set: ", LR.score( X_test, y_test))
 Accuracy on training set:  0.8778688524590164
 Accuracy on test set:  0.855

Predction on test set

In [52]:
# Prediction on test set

y_pred = LR.predict(X_test)
print(y_pred)
[0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0
 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 1 0
 0 1 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0]

Evaluating Model Performance

Model evaluation is the process of evaluating a trained model on the test data set.

Confusion Matrix

A confusion matrix is a table use to evaluate the performance of classification model

It contains the following:

True Positive (TP): Correct positive prediction

False Positive (FP): Incorrect positive prediction

True negative (TN): Correct negative prediction

False negative (FN): Incorrect negative prediction

Recall or sensitivity: Recall is the true positive predictions out of the total number of actual positive

Recall = TP/ (TP +FN)

Precision: It represent the number of true positive predictions divide by the total number of positive predictions.

Precision= TP/ (TP+FP)

Accuracy: Accuracy is the total number of correct predictions divide by the total predictions

Accuracy = (TP+TN) / (TP+TN+FN+FP)

f1_score: It is the harmonic mean of recall and precision.

f1_score = (2X Precision X Recall)/ (Precision + Recall)

Area Under Curve (AUC)

It represents the area under the ROC curve. An ROC curve is a plot of false positive rate on the x-axis and true positive rate on the y-axis

In [53]:
# Confusion matrix

cm = confusion_matrix(y_test, y_pred)
cm
Out[53]:
array([[123,  20],
       [  9,  48]], dtype=int64)
In [54]:
# Heatmap

plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
In [55]:
# Accuracy of test

from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score

rec = recall_score(y_test, y_pred)
pre = precision_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
f1_sc =  f1_score(y_test, y_pred)

print("Accuracy :: ",acc)
print("Precision :: ",pre)
print("Recall :: ", rec)
print("f1_score", f1_sc)
Accuracy ::  0.855
Precision ::  0.7058823529411765
Recall ::  0.8421052631578947
f1_score 0.7679999999999999
In [56]:
# ROC Curve, AUC

from sklearn import metrics

y_pred_proba = LR.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)

#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
auc:  0.8535762483130904
In [57]:
# calssification report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.93      0.86      0.89       143
           1       0.71      0.84      0.77        57

    accuracy                           0.85       200
   macro avg       0.82      0.85      0.83       200
weighted avg       0.87      0.85      0.86       200

In [58]:
results.loc["LogisticRegression"] = [acc,pre,rec, f1_sc, auc]

Decision Tree

Decision trees are a widely used model for classification and regression problems. A decision tree is made up of the decision node and leaf node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.

In [59]:
# set tuning paramters

tree = DecisionTreeClassifier(max_depth=2, random_state=0) 

# Fitting model to train set

tree.fit( X_train, y_train) 

# Checking for overfitting and underfitting

print(" Accuracy on training set: ",  tree.score( X_train, y_train)) 

print(" Accuracy on test set: ", tree.score( X_test, y_test))
 Accuracy on training set:  0.8729508196721312
 Accuracy on test set:  0.835
In [60]:
# Prediction on test set

y_pred1 = tree.predict(X_test)
print(y_pred)
[0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0
 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 1 0
 0 1 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0]
In [61]:
rec1 = recall_score(y_test, y_pred1)
pre1 = precision_score(y_test, y_pred1)
acc1 = accuracy_score(y_test, y_pred1)
f1_sc1 =  f1_score(y_test, y_pred1)

print("Accuracy :: ",acc1)
print("Precision :: ",pre1)
print("Recall :: ", rec1)
print("f1_score", f1_sc1)
Accuracy ::  0.835
Precision ::  0.6875
Recall ::  0.7719298245614035
f1_score 0.7272727272727273
In [62]:
# Confusion matrix

cm = confusion_matrix(y_test, y_pred)
cm
Out[62]:
array([[123,  20],
       [  9,  48]], dtype=int64)
In [63]:
# Heatmap

plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
In [64]:
# ROC Curve, AUC

from sklearn import metrics

y_pred_proba = tree.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc1 = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)

#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
auc:  0.8535762483130904
In [65]:
# calssification report

print(classification_report(y_test, y_pred1))
              precision    recall  f1-score   support

           0       0.90      0.86      0.88       143
           1       0.69      0.77      0.73        57

    accuracy                           0.83       200
   macro avg       0.80      0.82      0.80       200
weighted avg       0.84      0.83      0.84       200

In [66]:
# Feature importance

plt.figure(figsize=(15, 5))
importances = tree.feature_importances_
feature_importance = pd.Series(importances, index = X_train.columns)
feature_importance.plot(kind='bar')
plt.title('Feature importance')
plt.show()
In [67]:
results.loc["DecisionTreeClassifier"] = [acc1,pre1,rec1, f1_sc1, auc1]

Random Forest

A random forest is essentially a collection of decision trees, where each tree is slightly different from the others. The idea behind random forests is that each tree might do a relatively good job of predicting, but will likely overfit on part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging their results.

In [68]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(max_depth=2, n_estimators=5, random_state=0)

# Fitting model to train set

RF.fit( X_train, y_train) 

# Checking for overfitting and underfitting

print(" Accuracy on training set: ",  RF.score( X_train, y_train)) 

print(" Accuracy on test set: ", RF.score( X_test, y_test))
 Accuracy on training set:  0.8672131147540983
 Accuracy on test set:  0.83
In [69]:
# Prediction on test set

y_pred3 = RF.predict(X_test)
print(y_pred)
[0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0
 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 1 0
 0 1 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0]
In [70]:
rec2 = recall_score(y_test, y_pred3)
pre2 = precision_score(y_test, y_pred3)
acc2 = accuracy_score(y_test, y_pred3)
f1_sc2 =  f1_score(y_test, y_pred3)

print("Accuracy :: ",acc2)
print("Precision :: ",pre2)
print("Recall :: ", rec2)
print("f1_score", f1_sc2)
Accuracy ::  0.83
Precision ::  0.6825396825396826
Recall ::  0.7543859649122807
f1_score 0.7166666666666668
In [71]:
# Confusion matrix

cm = confusion_matrix(y_test, y_pred3)
cm
Out[71]:
array([[123,  20],
       [ 14,  43]], dtype=int64)
In [72]:
# Heatmap

plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
In [73]:
# ROC Curve, AUC

from sklearn import metrics

y_pred_proba = tree.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc2 = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)

#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
auc:  0.8535762483130904
In [74]:
# calssification report

print(classification_report(y_test, y_pred3))
              precision    recall  f1-score   support

           0       0.90      0.86      0.88       143
           1       0.68      0.75      0.72        57

    accuracy                           0.83       200
   macro avg       0.79      0.81      0.80       200
weighted avg       0.84      0.83      0.83       200

In [75]:
# Feature importance

plt.figure(figsize=(15, 5))
importances = RF.feature_importances_
feature_importance = pd.Series(importances, index = X_train.columns)
feature_importance.plot(kind='bar')
plt.title('Feature importance')
plt.show()
In [76]:
results.loc["RandomForestClassifier"] = [acc2,pre2,rec2, f1_sc2, auc2]

Gradient Boosting

Gradient boosting machines is an ensemble method that combines multiple decision trees to create a more powerful model. Despite the “regression” in the name, these models can be used for regression and classification.

In [77]:
from sklearn.ensemble import GradientBoostingClassifier

gbrt = GradientBoostingClassifier(max_depth=2, learning_rate=0.01, random_state=0)

# Fitting model to train set

gbrt.fit( X_train, y_train) 

# Checking for overfitting and underfitting

print(" Accuracy on training set: ",  gbrt.score( X_train, y_train)) 

print(" Accuracy on test set: ", gbrt.score( X_test, y_test))
 Accuracy on training set:  0.8696721311475409
 Accuracy on test set:  0.84
In [78]:
# Prediction on test set

y_pred4 = gbrt.predict(X_test)
print(y_pred4)
[0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0
 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0
 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0]
In [79]:
rec3 = recall_score(y_test, y_pred4)
pre3 = precision_score(y_test, y_pred4)
acc3 = accuracy_score(y_test, y_pred4)
f1_sc3 =  f1_score(y_test, y_pred4)

print("Accuracy :: ",acc)
print("Precision :: ",pre)
print("Recall :: ", rec)
print("f1_score", f1_sc)
Accuracy ::  0.855
Precision ::  0.7058823529411765
Recall ::  0.8421052631578947
f1_score 0.7679999999999999
In [80]:
cm = confusion_matrix(y_test, y_pred4)
cm
Out[80]:
array([[123,  20],
       [ 12,  45]], dtype=int64)
In [81]:
# Heatmap

plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
In [82]:
# ROC Curve, AUC

from sklearn import metrics

y_pred_proba = gbrt.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc3 = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)

#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
auc:  0.8535762483130904
In [83]:
# calssification report

print(classification_report(y_test, y_pred4))
              precision    recall  f1-score   support

           0       0.91      0.86      0.88       143
           1       0.69      0.79      0.74        57

    accuracy                           0.84       200
   macro avg       0.80      0.82      0.81       200
weighted avg       0.85      0.84      0.84       200

In [84]:
results.loc["GradientBoostingClassifier"] = [acc3,pre3,rec3, f1_sc3, auc3]
In [85]:
# Feature importance

plt.figure(figsize=(15, 5))
importances = gbrt.feature_importances_
feature_importance = pd.Series(importances, index = X_train.columns)
feature_importance.plot(kind='bar')
plt.title('Feature importance')
plt.show()

Step 7: Model Comparison

In [86]:
results = results*100
results
Out[86]:
Accuracy Precison Recall f1_score AUC
LogisticRegression 85.5 70.5882 84.2105 76.8 85.3576
DecisionTreeClassifier 83.5 68.75 77.193 72.7273 83.0941
GradientBoostingClassifier 84 69.2308 78.9474 73.7705 83.0144
RandomForestClassifier 83 68.254 75.4386 71.6667 83.0941
In [87]:
px.bar(results,y ="Accuracy",x = results.index,color = results.index,title="Accuracy Comparison")
In [88]:
px.bar(results,y ="Precison",x = results.index,color = results.index,title="Precision Comparison")
In [89]:
px.bar(results,y ="Recall",x = results.index,color = results.index,title="Recall Comparison")
In [90]:
px.bar(results,y ="f1_score",x = results.index,color = results.index,title="f1_score comparison")

Step 8: Pros and Cons of the Different Models

Logistic regression

Pros

• Simple algorithm that is easy to implement, does not require high computation power.

• Performs extremely well when the data is linearly separable.

• Less prone to over-fitting, with low-dimensional data.

Cons

• Poor performance on non-linear data

• Poor performance with highly correlated features.

Decision Tree

Pros

• Do not need to scale and normalize data.

• Handles missing values very well.

• Less effort in regard to preprocessing.

Cons

• Very prone to overfitting.

• Sensitive to outliers and changes in the data.

Random forest

Pros

• It reduces risk of overfitting since it is an ensemble of decision trees. For predicting the outcome, random forest takes inputs from all the trees and then predicts the outcome

• Excellent handling of missing data.

• Good Performance on Imbalanced datasets. It can also handle errors in imbalanced data (one class is majority and other class is minority).

• It can handle huge amount of data with higher dimensionality of variables.

• Little impact of outliers

• Useful to extract feature importance (can be used for feature selection).

Cons

• Random forests do not tend to perform well on very high dimensional, sparse data, such as text data.

Gradient boosting

Pros

• Less feature engineering required (No need for scaling, normalizing data, can also handle missing values well.

• Fast to interpret

• Outliers have minimal impact.

• Good model performance

• Less prone to overfitting

Cons

• Overfitting possible if parameters not tuned properly.

In [ ]: